Authors
Affiliation

Alessandro Pizzi

University of Lausanne

Andrea Lovato

Ayman El Abed

Illia Dorofieiev

Published

December 16, 2024

Abstract

This study investigates the behavioral, dietary, and lifestyle determinants of obesity in Mexico, Peru, and Colombia, with the aim of identifying significant predictors and establishing a robust framework for obesity risk prediction. Logistic regression models were applied, converting obesity levels into a binary classification to align with the requirements of the statistical methodology. The stepwise logistic regression model, optimized using the Akaike Information Criterion (AIC), emerged as the most effective approach, retaining the predictors most strongly associated with obesity while ensuring model interpretability. The results underscore the critical influence of family history, frequent consumption of high-calorie foods, and lifestyle factors such as physical activity and transportation modes on obesity risk. The model demonstrated excellent predictive performance, achieving an overall accuracy of 75.89% and an area under the curve (AUC) of 0.856, reflecting its strong discriminative capability. Moreover, the analysis revealed nuanced patterns, including the contrasting effects of snacking frequency and the complex interplay between transportation choices and physical activity, providing a deeper understanding of the multifactorial drivers of obesity. While the model excels in specificity and offers practical utility for public health interventions, its moderate sensitivity and the exclusion of broader environmental and psychosocial variables limit its scope. Additionally, the reliance on synthetic data and a non-representative sample constrains direct real-world applicability. However, this study offers an opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges.

1 Introduction

1.1 Project Goals

Obesity has emerged as one of the most pressing global health crises, with its prevalence nearly tripling worldwide since 1975, according to the World Health Organization (WHO). This alarming trend has fueled a dramatic rise in obesity-related diseases, including diabetes, cardiovascular conditions, and hypertension, imposing significant burdens on healthcare systems and economies. In Latin America and the Caribbean, the situation is particularly concerning: as of 2022, the Pan American Health Organization (PAHO) reported that nearly 25% of adults in the region are affected by obesity, emphasizing the urgent need for effective public health interventions. The crisis is especially acute in the countries central to this research. In 2018, Mexico recorded an adult obesity rate of 36.1%, while Peru and Colombia reported similarly worrisome rates of approximately 28% and 23%, respectively.

This widespread prevalence underscores the critical need for research focused on understanding and addressing the multifaceted factors contributing to obesity. In this context, the present study adopts an exploratory and primarily educational approach to examine the relationships between dietary habits, physical activity, and demographic variables, aiming to uncover their impact on obesity levels in Mexico, Peru, and Colombia. By leveraging a dataset consisting of 77% synthetically generated data (produced via the SMOTE algorithm) and 23% user-collected data from 498 participants, the research seeks to provide meaningful insights into this complex issue.

While the reliance on synthetic data and a non-representative sample limits direct real-world applicability, this study offers a unique opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges. Ultimately, the findings aim to lay the groundwork for future studies and contribute to the development of informed public health strategies and healthcare policies, demonstrating the transformative potential of data analytics in managing and mitigating complex issues.

1.2 Research Questions

  • Question 1

    What are the key lifestyle and behavioral factors that significantly contribute to obesity in Mexico, Peru, and Colombia?

  • Question 2

    Can we predict whether a person will be obese based on some given combinations of factors?

  • Question 3

    How can these insights be effectively leveraged to inform public health initiatives and combat the escalating health crisis?

2 Data

2.1 Sources

The dataset utilized in this project was obtained from the UCI Machine Learning Repository, a reputable and extensively used platform for data science and machine learning projects. Originally compiled by researchers at the Universidad de la Costa, Colombia, the dataset combines 77% synthetically generated data with 23% real-world data collected through a structured online survey. The synthetic data, created using the Synthetic Minority Over-sampling Technique (SMOTE) in Weka, addresses class imbalance, enhancing the dataset’s suitability for machine learning tasks. The real-world data, gathered from 498 participants over a 30-day period, captures detailed self-reported information on dietary habits, physical activity levels, and demographic characteristics. While synthetic data introduces uniformity and balance, it inherently lacks the complexity of real-world variability, and the user-collected data, though authentic, is susceptible to self-reporting biases and sampling limitations. These characteristics, along with the dataset’s diverse origins, make it an invaluable resource for simulating real-world challenges in healthcare analytics.

2.2 Description

The dataset consists of 2111 records and 17 attributes, offering a detailed examination of the factors contributing to obesity. The attributes represent a mix of categorical and continuous variables, providing insights into demographic, lifestyle, and behavioral factors. In greater detail, an interactive table was designed to provide a comprehensive summary of the dataset’s variables.

Code
library(here)
library(knitr)
# Main features of the dataset
dataset_raw <- read.csv(here("data/raw/dataset_raw.csv"))
head(dataset_raw) %>%
  kbl(format = "html", caption = "First 6 Rows of the Raw Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  scroll_box(width = "100%", height = "400px")
First 6 Rows of the Raw Dataset
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
Female 21 1.62 64.0 yes no 2 3 Sometimes no 2 no 0 1 no Public_Transportation Normal_Weight
Female 21 1.52 56.0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation Normal_Weight
Male 23 1.80 77.0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation Normal_Weight
Male 27 1.80 87.0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking Overweight_Level_I
Male 22 1.78 89.8 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation Overweight_Level_II
Male 29 1.62 53.0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile Normal_Weight
Code
val_meaning <- c("Indicates the gender of the individual (Male/Female).", "Represents the age of participants in years.", "The height of individuals in meters.", "The weight of participants in kilograms.", "Indicates whether a family member has suffered from overweight (Yes/No).", "Indicates if participants frequently consume high-caloric foods (Yes/No).", "Scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).", "Indicates the typical number of main meals consumed daily.", "Describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).", "Indicates whether participants smoke (Yes/No).", "Scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).", "Whether participants monitor their calorie intake (Yes/No).", "Scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).", "Reflects daily time spent on technological devices, in hours.", "Indicates the frequency of alcohol consumption (e.g., I don't drink, Sometimes, Frequently, Always).", "Describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).", "The target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).")
desc_table <- tibble::tibble(Name = colnames(dataset_raw), Type = sapply(dataset_raw, class), Meaning = val_meaning)
desc_table %>%
  kbl(format = "html", caption = "Variable Descriptions") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  column_spec(1, bold = TRUE, width = "200px") %>%
  column_spec(2, width = "150px") %>%
  column_spec(3, width = "500px") %>%
  scroll_box(width = "100%", height = "500px")
Variable Descriptions
Name Type Meaning
Gender character Indicates the gender of the individual (Male/Female).
Age numeric Represents the age of participants in years.
Height numeric The height of individuals in meters.
Weight numeric The weight of participants in kilograms.
family_history_with_overweight character Indicates whether a family member has suffered from overweight (Yes/No).
FAVC character Indicates if participants frequently consume high-caloric foods (Yes/No).
FCVC numeric Scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).
NCP numeric Indicates the typical number of main meals consumed daily.
CAEC character Describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).
SMOKE character Indicates whether participants smoke (Yes/No).
CH2O numeric Scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).
SCC character Whether participants monitor their calorie intake (Yes/No).
FAF numeric Scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).
TUE numeric Reflects daily time spent on technological devices, in hours.
CALC character Indicates the frequency of alcohol consumption (e.g., I don't drink, Sometimes, Frequently, Always).
MTRANS character Describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).
NObeyesdad character The target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).

The dataset underwent a detailed preprocessing phase, including normalization of continuous variables, encoding of categorical data, and removal of missing or atypical entries to ensure high-quality analysis. Class imbalance was addressed using the SMOTE (Synthetic Minority Oversampling Technique), generating synthetic data while carefully avoiding noise or artificial patterns. The final dataset comprises 77% synthetic data, which enhances balance and diversity, and 23% real-world data, adding authenticity. This combination allows for a comprehensive analysis of obesity-related factors, while recognizing potential biases, such as inaccuracies in self-reported information.

2.3 Wrangling

Essential libraries for data manipulation, visualization, and clustering are loaded to begin the wrangling process and support subsequent analysis. Each package is utilized for its specific functionality, facilitating efficient and streamlined analysis:

  • dplyr: for data manipulation (e.g., filtering, summarizing);

  • tidyr: for data tidying (e.g., reshaping);

  • ggplot2: for visualization;

  • corrplot: for correlation matrix visualization;

  • ggridges: for creating ridge plots;

  • cluster: for clustering algorithms;

  • reshape2: for data reshaping, especially during visualization.

Code
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(ggridges)
library(cluster)
library(reshape2)

Column names are renamed to enhance clarity and improve usability during the analysis. The updated names are designed to be shorter and more intuitive, ensuring ease of reference while retaining their original meaning and context. This adjustment simplifies code readability and helps streamline data manipulation tasks, particularly in complex analytical workflows.

Code
  dataset <- dataset_raw %>%
  rename(
    family_hist = family_history_with_overweight,
    obesity_lev = NObeyesdad,
    caloric_food = FAVC,
    vegetable_food = FCVC,
    nb_meal_day = NCP,
    food_btw_meals = CAEC,
    ch2o = CH2O,
    smoke = SMOKE,
    calorie_check = SCC,
    physical_act = FAF,
    freq_alcohol = CALC,
    use_tech = TUE,
    m_trans = MTRANS,
    gender = Gender,
    age = Age,
    weight = Weight,
    height = Height
  )

The structure of the dataset is examined to identify the data types of each variable, providing critical insights for subsequent data preparation. Understanding the data types helps pinpoint columns requiring transformations, such as converting categorical variables to factors or standardizing numeric variables for analysis.

Code
str_output <- capture.output(str(dataset))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)
str_table %>%
  kbl(format = "html", caption = "Original structure of the Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  scroll_box(width = "100%", height = "400px")
Original structure of the Dataset
Structure
'data.frame': 2111 obs. of 17 variables:
$ gender : chr "Female" "Female" "Male" "Male" ...
$ age : num 21 21 23 27 22 29 23 22 24 22 ...
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_hist : chr "yes" "yes" "yes" "no" ...
$ caloric_food : chr "no" "no" "no" "no" ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
$ smoke : chr "no" "yes" "no" "no" ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : chr "no" "yes" "no" "no" ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : chr "no" "Sometimes" "Frequently" "Frequently" ...
$ m_trans : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
$ obesity_lev : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
Code
dataset <- dataset %>%
  mutate(
    gender = as.factor(gender),
    family_hist = as.factor(family_hist),
    caloric_food = as.factor(caloric_food),
    smoke = as.factor(smoke),
    calorie_check = as.factor(calorie_check),
    m_trans = as.factor(m_trans),
    obesity_lev = factor(obesity_lev, 
                         levels = c("Insufficient_Weight", "Normal_Weight", 
                                    "Overweight_Level_I", "Overweight_Level_II", 
                                    "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), 
                         ordered = TRUE),
    food_btw_meals = factor(ifelse(food_btw_meals == "no", "No", food_btw_meals), 
                            levels = c("No", "Sometimes", "Frequently", "Always"), 
                            ordered = TRUE),
    freq_alcohol = factor(ifelse(freq_alcohol == "no", "No", freq_alcohol), 
                          levels = c("No", "Sometimes", "Frequently", "Always"), 
                          ordered = TRUE))


str_output <- capture.output(str(dataset))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)
str_table %>%
  kbl(format = "html", caption = "Manipulated Dataset Structure") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Manipulated Dataset Structure
Structure
'data.frame': 2111 obs. of 17 variables:
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
$ age : num 21 21 23 27 22 29 23 22 24 22 ...
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_hist : Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
$ caloric_food : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: Ord.factor w/ 4 levels "No"<"Sometimes"<..: 2 2 2 2 2 2 2 2 2 2 ...
$ smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : Ord.factor w/ 4 levels "No"<"Sometimes"<..: 1 2 3 3 2 2 2 2 3 1 ...
$ m_trans : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
$ obesity_lev : Ord.factor w/ 7 levels "Insufficient_Weight"<..: 2 2 2 3 4 2 2 2 2 2 ...

The transformations ensured the dataset was ready for analysis by restructuring categorical and ordinal variables to meet modeling requirements. Converting categorical variables into factors standardized their representation, reducing ambiguity and improving compatibility with statistical models. For ordinal variables, levels were explicitly ordered to preserve their logical progression and enhance interpretability, allowing for meaningful comparisons across categories.

The updated structure was reviewed to confirm the accuracy of these adjustments, providing confidence in the preprocessing steps. While further transformations like normalization were not applied, the focus on categorical and ordinal adjustments established a strong foundation for reliable and interpretable analysis. In particular, the levels of obesity categories, food consumption between meals, and frequency of alcohol use were arranged to reflect increasing severity or frequency, ensuring these variables captured their intended relationships and supported clear, accurate insights into the data.

Now, a numerical version of the dataset, called “dataset_num”, is created by transforming categorical variables into numerical values, ensuring compatibility with statistical analyses while maintaining logical relationships and interpretability. This numerical transformation is specifically essential for developing the correlation matrix, as it requires all variables to be in numeric format to analyze their relationships effectively.

The presence of potential missing values in the transformed dataset is checked and visualized to confirm data integrity and ensure no issues have been introduced during the conversion process.

Code
dataset_num <- dataset %>%
  mutate(obesity_lev = recode(obesity_lev,
                              "Insufficient_Weight"=1,
                              "Normal_Weight" = 2,
                              "Overweight_Level_I" = 3,
                              "Overweight_Level_II" = 4,
                              "Obesity_Type_I" = 5,
                              "Obesity_Type_II" = 6,
                              "Obesity_Type_III" = 7,
  ))

dataset_num <- dataset %>%
  mutate(freq_alcohol = recode(freq_alcohol,
                               "No"=1,        
                               "Sometimes"=2, 
                               "Frequently" =3,
                               "Always"  =4 
  ))

dataset_num <- dataset %>%
  mutate(m_trans = recode(m_trans,
                          "Automobile"=1,
                          "Bike"=2,
                          "Motorbike"=3,
                          "Public_Transportation"=4,
                          "Walking"=5,
  ))

dataset_num <- dataset %>%
  mutate(food_btw_meals = recode(food_btw_meals,
                                 "No"=0,
                                 "Sometimes"=1 ,
                                 "Frequently"=2,
                                 "Always"=3
  )
  )

dataset_num <- dataset_num%>%
  mutate(calorie_check = recode(calorie_check,
                                "no"=0,
                                "yes"=1 ,
  ))

dataset_num <- dataset_num %>%
  mutate(across(where(is.factor), ~ as.numeric(.)))


str_output <- capture.output(str(dataset_num))
table_num_str <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

table_num_str %>%
  kbl(format = "html", caption = "Structure of the Numerical Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Structure of the Numerical Dataset
Structure
'data.frame': 2111 obs. of 17 variables:
$ gender : num 1 1 2 2 2 2 1 2 2 2 ...
$ age : num 21 21 23 27 22 29 23 22 24 22 ...
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_hist : num 2 2 2 1 1 1 2 1 2 2 ...
$ caloric_food : num 1 1 1 1 1 2 2 1 2 2 ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: num 1 1 1 1 1 1 1 1 1 1 ...
$ smoke : num 1 2 1 1 1 1 1 1 1 1 ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : num 0 1 0 0 0 0 0 0 0 0 ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : num 1 2 3 3 2 2 2 2 3 1 ...
$ m_trans : num 4 4 4 5 4 1 3 4 4 4 ...
$ obesity_lev : num 2 2 2 3 4 2 2 2 2 2 ...
Code
nb_na<- colSums(is.na(dataset_num))
nb_na %>%
  kbl(format = "html", caption = "Presence of Potential NA Values in the Dataset") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), 
    full_width = FALSE, 
    position = "left"
  ) %>%
  column_spec(1, width = "100px") %>%
  column_spec(2, width = "80px") %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Presence of Potential NA Values in the Dataset
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

The test results confirmed the absence of any NA values in the dataset, indicating that all variables were successfully converted to numeric format without compromising data integrity.

2.4 Spotting Mistakes and Missing Data

Check for missing values

To ensure data integrity, missing values in the dataset are examined by counting “NA” values in each column, providing a clear view of dataset completeness. The results are presented in a formatted table for easy interpretation, with styling applied for readability and a scrollable box to handle larger datasets. This process facilitates prompt handling of missing data through appropriate strategies.

Code
missing_values <- colSums(is.na(dataset))
missing_values %>%
  kbl(format = "html", caption = "Missing Values in Each Column") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width=FALSE, position = "center") %>%
  column_spec(1, width = "100px") %>%
  column_spec(2, width = "80px") %>%
  row_spec(0, bold = TRUE, background = "#f0f0f0") %>%
  scroll_box(width = "100%", height = "400px")
Missing Values in Each Column
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

The analysis confirms that all columns contain complete data, with no missing values identified. This completeness ensures a robust foundation for subsequent analysis, eliminating the need for immediate data cleaning related to missing entries.

Check for duplicates

The dataset is examined for duplicated rows to ensure data integrity and eliminate redundancy. Identifying and addressing duplicates is a crucial step in data preprocessing, as redundant entries can skew analysis results and lead to misleading conclusions. This process involves systematically scanning the dataset for identical rows and quantifying their occurrence.

Code
duplicated_rows <- sum(duplicated(dataset))
duplicated_rows
[1] 24

The detection of 24 duplicated rows in the dataset highlights the need for further preprocessing to ensure data integrity, as these redundant entries could skew analysis if not properly handled.

Code
dataset <- dataset %>%
distinct()

nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE

The dataset was refined by removing duplicate entries to ensure that only unique rows are retained. A verification step confirmed that no duplicates remain, ensuring the dataset’s integrity and reliability for further analysis.

2.5 Listing Anomalies and Outliers

A bar chart was created to visualize the distribution of obesity levels, providing a clear overview of class frequencies within the dataset. Particular attention is given to obesity levels, as this variable serves as the dependent variable in the predictive model to be developed later.

Code
g1 <- ggplot(dataset, aes(x = obesity_lev)) +
  geom_bar(fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Class Distribution of Obesity Levels",
    x = "Obesity Level",
    y = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #Adjusted the text for clarity

plotly_plot <- ggplotly(g1)
plotly_plot

The chart highlights a balanced distribution across all obesity levels, demonstrating the effectiveness of SMOTE in addressing class imbalance. By equalizing the representation of each category, the dataset becomes more reliable for analysis, reducing biases and ensuring a fair evaluation of patterns within the data. On the other hand, the synthetic data introduced by SMOTE may not fully reflect real-world variability, potentially leading to artificial patterns that could affect the interpretability of results.

A density plot was generated to visualize the age distribution across different obesity levels, providing insights into patterns and trends within the data.

Code
g2 <- ggplot(dataset, aes(x = age, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
    labs(
    title = "Age Distribution by Obesity Levels",
    x = "Age",
    y = "Density",
    fill = "Obesity Level") +
    xlim(14, 50)
plotly_plot1 <- ggplotly(g2)
plotly_plot1

This graph provides a detailed view of the age distribution across obesity levels and offers insight into the impact of the SMOTE algorithm in balancing the dataset. The distributions show distinct separation among obesity categories, with younger ages predominantly associated with lower obesity levels (e.g., Insufficient Weight and Normal Weight), while older ages are more prevalent in higher obesity categories (e.g., Obesity Type II and III).

Notably, sharp peaks in the density curves, such as the one around age 30 in “Obesity Type I,” could indicate potential artifacts introduced during the synthetic data generation process. While these patterns align with logical demographic trends, they highlight the need for further validation to ensure that such separations and peaks represent realistic population characteristics rather than biases from data augmentation. Overall, the dataset reflects clear and interpretable patterns, but these observations suggest the importance of cautious interpretation and robust validation in subsequent analyses.

Summary statistics were computed for key variables across obesity levels to identify potential anomalies or patterns, providing a clearer understanding of how age, height, and weight vary within each category.

Code
dataset_stat <- dataset %>%
  group_by(obesity_lev) %>%
  summarize(
    Age_Mean = mean(age, na.rm = TRUE),
    Age_SD = sd(age, na.rm = TRUE),
    Height_Mean = mean(height, na.rm = TRUE),
    Height_SD = sd(height, na.rm = TRUE),
    Weight_Mean = mean(weight, na.rm = TRUE),
    Weight_SD = sd(weight, na.rm = TRUE)
  )
dataset_stat %>%
  kbl(format = "html", caption = "Summary Statistics by Obesity Level", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Summary Statistics by Obesity Level
obesity_lev Age_Mean Age_SD Height_Mean Height_SD Weight_Mean Weight_SD
Insufficient_Weight 19.8 2.7 1.7 0.1 50.0 6.0
Normal_Weight 21.8 5.1 1.7 0.1 62.2 9.3
Overweight_Level_I 23.5 6.3 1.7 0.1 74.5 8.6
Overweight_Level_II 27.0 8.1 1.7 0.1 82.1 8.5
Obesity_Type_I 25.9 7.8 1.7 0.1 92.9 11.5
Obesity_Type_II 28.2 4.9 1.8 0.1 115.3 8.0
Obesity_Type_III 23.5 2.8 1.7 0.1 120.9 15.5

The summary statistics reveal distinct differences across obesity levels. As expected, weight increases progressively with higher obesity categories, accompanied by slightly larger variations in standard deviation. Interestingly, height remains relatively constant across categories, suggesting it plays a limited role in distinguishing obesity levels. The age distribution shows a notable shift, with younger individuals dominating the lower obesity levels and a broader age range in higher levels, highlighting potential demographic patterns worth further exploration. These insights confirm the logical trends in the dataset, providing confidence in its structure while emphasizing the need for further analysis of these relationships.

Clustering was performed using k-means to explore the dataset’s structure and assess the coherence of the groups, with the silhouette score calculated to evaluate the quality and separation of the clusters.

Code
library(cluster)
set.seed(123)
kmeans_res <- kmeans(select(dataset, where(is.numeric)), centers = length(unique(dataset$obesity_lev)))
silhouette_score <- silhouette(kmeans_res$cluster, dist(select(dataset, where(is.numeric))))
mean_silhouette_score <- mean(silhouette_score[, "sil_width"])
mean_silhouette_score
[1] 0.4513519

The mean silhouette score of approximately 0.456 indicates moderate cohesion within clusters and reasonable separation between them. This suggests that the clusters, representing different obesity levels, are distinguishable but not excessively isolated. The result reflects a balance between natural class separability and the effects of data augmentation with SMOTE, which appears to have effectively balanced the dataset without introducing significant distortions. These findings provide confidence in the dataset’s suitability for clustering-based exploration while highlighting the importance of further validation to ensure the robustness of the observed patterns.

2.6 Correlation Analysis

To explore relationships among variables and their association with obesity levels, a correlation matrix was computed. The analysis focuses on identifying the strength and direction of correlations between “obesity_lev” (the dependent variable) and other predictors, such as physical activity, frequency of alcohol consumption, and dietary habits. By ordering variables based on their correlation with “obesity_lev”, the matrix highlights the most influential factors in determining obesity levels. A heatmap visualization was then created to provide an intuitive representation of these relationships, with a gradient color scale indicating the strength of positive and negative correlations. This approach facilitates the identification of key variables for further analysis and modeling.

Code
#Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>% select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight","height", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech", "ch2o", "m_trans", "smoke","nb_meal_day", "calorie_check", "gender"),use = "complete.obs")

#Extract the correlations with 'obesity_lev'
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

#Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

#Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

#Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)

g3 <- ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) + 
    geom_tile() + 
    geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5, hjust = 0.5) +
    scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
    labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        axis.text.y = element_text(angle = 45, vjust = 1))

plot3 <- ggplotly(g3)
plot3
Code
# Create the heatmap with correlation values

# Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>% select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech","ch2o", "height", "calorie_check", "gender"), use = "complete.obs")

# Extract the correlations with "obesity_lev"
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

# Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

# Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

# Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)

g4 <- ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5
            , hjust = 0.5) + # Center text within tiles
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        axis.text.y = element_text(angle = 45, vjust = 1) 
  )
plot4 <- ggplotly(g4)
plot4

The correlation matrices provide valuable insights into the relationships between variables and their association with obesity levels. As expected, weight exhibits a very strong positive correlation with obesity level, reinforcing its central role in defining the target variable. Family history of obesity and caloric food consumption also show moderate positive correlations, highlighting their relevance as predictive factors.

Conversely, variables such as physical activity and food consumption between meals exhibit weak or negative correlations, suggesting that their influence on obesity levels is less pronounced. These patterns align with logical trends but also underscore the need for careful consideration of multicollinearity and the relative importance of variables in predictive modeling. The heatmap’s clear organization of variables by their correlation strength aids in identifying the most impactful factors for further analysis. Overall, the results confirm that the dataset’s structure supports a robust examination of the factors influencing obesity.

3 Exploratory Data Analysis (EDA)

The Exploratory Data Analysis (EDA) phase of the project was designed to uncover meaningful patterns and insights while ensuring the dataset was optimized for analysis. A correlation heatmap was employed early in the process to identify and assess relationships between variables. By comparing the initial and refined versions of the heatmap, we effectively filtered out less relevant variables, allowing the analysis to focus on the most impactful features. This step not only streamlined the dataset but also enhanced its interpretability, ensuring a more targeted exploration of key patterns.

The retained variables were selected based on their strong correlations with the target variable and their potential to reveal underlying trends. The EDA process systematically examined these variables using various visualizations to uncover distributions, trends, and potential anomalies in the data.

In the initial stage, the analysis focused on Age, Height, and Weight—variables identified as most relevant through the refined heatmap. These features were prioritized due to their strong relationships with the target outcomes and their foundational role in understanding key trends within the dataset.

3.1 Age

Code
age_summary <- summary(dataset$age)
age_sd <- sd(dataset$age, na.rm = TRUE)
sum_age_df <- tibble::tibble(
  Metric = c(names(age_summary), "Standard Deviation"),
  Value = round(c(age_summary, age_sd), 2)
)
kable(sum_age_df, format = "markdown", caption = "Age Variable Statistics")
Age Variable Statistics
Metric Value
Min. 14.00
1st Qu. 19.92
Median 22.85
Mean 24.35
3rd Qu. 26.00
Max. 61.00
Standard Deviation 6.37

The age variable exhibits a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years, indicating a slight asymmetry toward younger ages. The range spans from 14 to 61 years, though the majority of individuals fall within the 20–30 age group. A standard deviation of 6.35 years reflects moderate variability in age across the dataset. This predominantly young sample may introduce limitations when generalizing findings to older populations, where obesity-related factors might differ significantly.

Code
g5 <- ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
  labs(title = "Age Distribution by Obesity Level", x = "Obesity Level", y = "Age") +
  theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot5 <- ggplotly(g5)
plot5

A violin and boxplot combination was designed to visualize the age distribution across various obesity levels, offering a detailed perspective on patterns and trends within the data. This visualization highlights that individuals with insufficient or normal weight are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels, such as Obesity Type I and Type II, display a broader age range, peaking around 30–40 years. Severe obesity (Type III) is rare among younger individuals but becomes more prevalent in mid-adulthood. This visualization emphasizes the gradual increase in obesity risk with age, underlining the importance of early intervention, particularly during early and mid-adulthood, when such risks are most pronounced.

Code
g6 <- ggplot(dataset, aes(x = age, y = as.numeric(obesity_lev))) +
  geom_jitter(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "Trend of Obesity Level with Age", x = "Age", y = "Obesity Level") +
  theme_minimal()
plot6 <- ggplotly(g6)
plot6

Complementing this, the trend line graph further captures the trajectory of obesity levels with age. A sharp rise in obesity is observed from adolescence to early adulthood, peaking in the 25–30 years range. This critical transition phase is likely influenced by lifestyle factors such as reduced physical activity, increased caloric intake, and metabolic changes. After this peak, the trend reveals a gradual decline in obesity levels beyond 30 years, potentially reflecting improved health awareness, dietary adjustments, or a selection bias in older populations. These insights underscore the mid-20s to early-30s as a pivotal stage for targeted obesity prevention and intervention strategies.

3.2 Height

Code
height_summary <- summary(dataset$height)
height_sd <- sd(dataset$height, na.rm = TRUE)
sum_height_df <- tibble::tibble(
  Metric = c(names(height_summary), "Standard Deviation"),
  Value = round(c(height_summary, height_sd), 2)
)
kable(sum_height_df, format = "markdown", caption = "Height Variable Statistics")
Height Variable Statistics
Metric Value
Min. 1.45
1st Qu. 1.63
Median 1.70
Mean 1.70
3rd Qu. 1.77
Max. 1.98
Standard Deviation 0.09
Code
g7 <- ggplot(dataset, aes(x = height)) +
  geom_histogram(bins = 20, fill = "purple", color = "black", alpha = 0.7) +
  labs(title = "Height Distribution", x = "Height (m)", y = "Count") +
  theme_minimal()
plot7 <- ggplotly(g7)
plot7

A bar chart was generated to visualize the height distribution, revealing an approximately normal shape with a slight right skew. Most values range between 1.45m and 1.98m, peaking around 1.75m, which represents the most common height. Both the mean and median are 1.7m, confirming a nearly symmetrical distribution. The standard deviation of 0.09 indicates low variability, and no extreme outliers are observed, highlighting a realistic and consistent dataset for height.

Code
g8 <- ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +
  geom_violin(alpha = 0.6) +
  labs(title = "Height Distribution by Obesity Level", x = "Obesity Level", y = "Height") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
plot8 <- ggplotly(g8)
plot8

The violin plot was utilized to further explores the height distribution across obesity levels. Each category exhibits relatively low variability, with overlapping ranges across groups. Insufficient and Normal Weight categories have slightly narrower distributions, centered around 1.7m. As obesity levels increase, from Obesity Type I to Type III, the distributions remain consistent, indicating that height does not significantly vary with obesity classification. These findings suggest that while height remains a stable feature, weight likely plays a more decisive role in determining obesity levels.

3.3 Weight

Code
g9 <- ggplot(dataset, aes(x = weight, fill = gender)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Weight by Gender", x = "Weight", y = "Density") +
  scale_fill_manual(values = c("pink", "lightblue"), name = "Gender", labels = c("Female", "Male")) +
  theme_minimal()
plot9 <- ggplotly(g9)
plot9

The development of the density plot showed distinct differences in weight distribution between genders. Females generally exhibit lower weights, with a peak around 70 units, whereas males show peaks at 85 and 115 units, reflecting a tendency toward higher weights. An overlapping region between 80 and 90 units indicates common weight ranges for both genders, though the distinct peaks underscore gender-based differences.

Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +
  geom_density_ridges(scale = 0.9, alpha = 0.6) +
  labs(title = "Ridgeline Plot of Weight by Obesity Level", x = "Weight", y = "Obesity Level") +
  theme_minimal() +
  theme(legend.position = "none")

Building on these insights, the ridgeline plot further illustrates the relationship between weight and obesity levels. As obesity levels increase, the weight distribution shifts consistently toward higher values. Categories such as “Insufficient Weight” and “Normal Weight” cluster at lower ranges, while higher obesity types (I, II, and III) peak at significantly greater weights. This clear progression confirms a strong positive association between weight and obesity levels, reinforcing the centrality of weight in obesity classification. The dataset’s average weight remains at 86.6 units with a standard deviation of 26.6, capturing the variability across different obesity categories.

3.4 Height and Weight

To explore the relationship between height and weight more thoroughly, scatter plots were utilized to analyze trends across obesity levels, offering valuable insights that align with theoretical expectations and highlight weight variations within height ranges for different obesity classifications.

Code
g11 <- ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, aes(group = obesity_lev)) +  # Adds a trend line for each obesity level
  ggtitle("Scatter Plot of Weight vs Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level")
plot11 <- ggplotly(g11)
plot11
Code
g12 <- ggplot(dataset, aes(x = height, y = weight)) +
  geom_point(alpha = 0.7, aes(color = obesity_lev)) +
  facet_wrap(~ obesity_lev) +
  ggtitle("Facet Grid of Weight and Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level") +
  theme(legend.position = "none")
plot12 <- ggplotly(g12)
plot12

Expanding on this analysis, the first scatter plot provides a broad overview of the positive trend between height and weight across obesity levels. While this visualization captures the general relationship, the overlapping data points obscure category-specific details, making it difficult to differentiate unique patterns within each obesity group.

To address this limitation, the data was refined into a facet grid that isolates each obesity category, offering a more nuanced perspective on weight-height trends. The facet grid reveals notable differences in the slopes of the relationships: higher obesity levels, such as Obesity Type II and III, display steeper slopes, suggesting a stronger height-weight correlation, while lower categories like Insufficient Weight and Normal Weight exhibit flatter slopes, indicating a weaker association. This detailed examination underscores the variability in the height-weight relationship across obesity levels, providing a clearer understanding of how these factors interact within distinct classifications.

Code
correlation_height_weight <- cor(dataset$height, dataset$weight, use = "complete.obs")
correlation_height_weight
[1] 0.457468

The observed correlation between height and weight (r = 0.457) aligns with findings in existing literature, confirming a moderate positive relationship and reinforcing the expectation that taller individuals generally weigh more, though the strength of this association varies slightly across obesity levels.

With the analysis of age, height, and weight completed, attention shifts to exploring the remaining variables in the dataset. These variables, while less directly correlated with the target outcomes, offer critical insights into behavioral, lifestyle, and environmental factors that may influence obesity levels.

3.5 Food between meals

Code
g13 <- ggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +
   labs(x = "Food Between Meals", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14))

plot13 <- ggplotly(g13)
plot13
Code
g14 <- ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +
    geom_bar(position = "fill") + # Stacked bar chart with proportions
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # Format y-axis as percentages
    ggtitle("Proportion of Food Between Meals Across Obesity Levels") + # Shortened and clear title
    labs(x = "Obesity Levels", y = "Proportion (%)", fill = "Food Between Meals") + # Correct axis and legend labels
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis text for readability
        plot.title = element_text(hjust = 0.5, size = 14) # Center and style the title
    )
plot14 <- ggplotly(g14)
plot14

Together, these visualizations were designed to analyze the frequency of eating between meals, providing insights into both total prevalence and shifting trends across obesity categories. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.

3.6 High-caloric food consumption

Code
g16 <- ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +
  geom_bar(
    position = "dodge",
    color = "black"
  ) +
  ggtitle("Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +
  labs(x = "Obesity Levels", y = "Count", fill = "High-Caloric Food Consumption") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14)
  )
plot16 <- ggplotly(g16)
plot16

To investigate the consumption of high-caloric food a grouped bar chart used, highlighting a clear trend of increased high-caloric food consumption as obesity levels rise. High-caloric food consumption (“yes”) is the dominant behavior across all categories, with the highest counts observed in higher obesity levels (Obesity Type I–III). In contrast, “no” responses are more visible in lower categories such as Insufficient Weight and Normal Weight, though they remain significantly lower in count compared to “yes” responses. This pattern underscores the strong association between high-caloric food intake and obesity progression.

Code
percentage_high_caloric_consumers <- mean(dataset$caloric_food == "yes") * 100
percentage_high_caloric_consumers
[1] 88.35649

More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, a behavior strongly associated with weight gain. This underscores the critical importance of dietary interventions aimed at reducing high-calorie intake to address obesity progression effectively.

3.7 Alcohol consumption

Code
# Filter out "Always" responses from the dataset
filtered_dataset <- dataset %>%
  filter(freq_alcohol != "Always")

g17 <- ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +
   labs(x = "Alcohol Consumption Frequency", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

plot17 <- ggplotly(g17)
plot17

Regarding alcohol consumption, the chart chosen excludes “Always” responses due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. Instead, the data reveals that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.

Code
# Prepare the data summary for 'Sometimes' and 'No' responses
data_summary <- dataset %>%
  filter(freq_alcohol %in% c("Sometimes", "No")) %>%
  group_by(obesity_lev, freq_alcohol) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(obesity_lev) %>%
  mutate(
    total = sum(count),
    proportion = count / total
  ) %>%
  ungroup()

# Visualization with updated title
g18 <- ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +  # Format y-axis as percentages
  ggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +
  labs(x = "Obesity Level", y = "Proportion (%)", color = "Alcohol Frequency") +
  scale_color_manual(values = c("No" = "purple", "Sometimes" = "gold")) + # Improved color scheme
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14),  # Center and style title
    legend.position = "top"
  )
plot18 <- ggplotly(g18)
plot18

To better illustrate the trends in alcohol consumption frequency across obesity levels, this graph was created to highlight the shifting proportions of individuals consuming alcohol “Sometimes” and abstaining (“No”). The proportion of individuals who drink alcohol “Sometimes” shows a steady increase with higher obesity levels, peaking in Obesity_Type_III. Conversely, the proportion of those who abstain from alcohol decreases as obesity levels rise, suggesting an inverse relationship between abstention and obesity severity.

This pattern raises questions about the potential interaction between alcohol consumption frequency and caloric food preferences, as both behaviors appear to be associated with higher obesity levels. Investigating this interaction could provide insights into whether a combination of moderate alcohol consumption and high-calorie food preferences exerts a compounded effect on obesity risk. Understanding these combined lifestyle factors could inform strategies aimed at mitigating obesity progression more effectively.

3.8 Daily Calorie Monitoring

Code
# Dodged Bar Chart for calorie_check by Obesity Levels
g19 <- ggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("    Dodged Bar Chart for the check of the calories by Obesity Levels") +
   labs(x = "Check of calories", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title
plot19 <- ggplotly(g19)
plot19
Code
data_summary <- dataset %>%
  group_by(obesity_lev, calorie_check) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(total = sum(count), proportion = count / total)

# Proportion of Calorie Checking by Obesity Level
g20 <- ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("no" = "lightcoral", "yes" = "lightblue")) +
  labs(title = "Proportion of Calorie Checking by Obesity Level", x = "Obesity Level", y = "Proportion", color = "Calorie Check") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
plot20 <- ggplotly(g20)
plot20

For the variable calorie_check, a Dodged Bar Chart was rendered, highlighting two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight.

The proportion graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.

3.9 Vegetable consumption

Code
g21 <- ggplot(dataset, aes(x = vegetable_food)) +
  geom_histogram(aes(y =after_stat(density)), bins = 30, fill = "lightgreen", color = "black", alpha = 0.6) +
  geom_density(color = "darkgreen", linewidth = 1) +
  ggtitle("Histogram and Density of Vegetable Food Consumption") +
  theme_minimal() +
  labs(x = "Vegetable Food Consumption", y = "Density")
plot21 <- ggplotly(g21)
plot21
Code
g22 <- ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Vegetable Food Consumption", 
         x = "Weight", 
         y = "Vegetable Food Consumption") +
    theme_minimal() +
    coord_cartesian(xlim= c(40, 135), ylim= c(2, 3))
plot22 <- ggplotly(g22)
plot22

To explore the relationship between vegetable food consumption and weight, two visualizations were employed. A histogram combined with a density plot was used to examine the overall distribution of vegetable consumption across the dataset. This provided insights into the frequency and variability of vegetable intake levels, with distinct peaks indicating common consumption behaviors. Following this, a scatterplot was generated with a trend line illustrating a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.

This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.

The trend reveals two possible key insights:

  • A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
  • The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.

3.10 Physical activity

To explore how physical activity varies across obesity levels, two complementary visualizations were utilized.

Code
g23 <- ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +  
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
    ggtitle("Violin Plot of Physical Activity by Obesity Level") +
  labs(x = "Obesity Level", y = "Physical Activity") +
  theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot23 <- ggplotly(g23)
plot23
Code
# Calculate the average physical activity per obesity level
average_activity <- dataset %>%
  group_by(obesity_lev) %>%
  summarise(mean_activity = mean(physical_act))

# Plot the average physical activity
g24 <- ggplot(average_activity, aes(x = obesity_lev, y = mean_activity, group = 1)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(size = 3, color = "darkblue") +
  ggtitle("Average Physical Activity Across Obesity Levels") +
  labs(x = "Obesity Level", y = "Average Physical Activity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 14))
plot24 <- ggplotly(g24)
plot24

The line chart emphasizes the consistent decline in average physical activity as obesity levels increase. Starting from Insufficient and Normal Weight categories, the average activity levels gradually drop, stabilizing briefly at Obesity Types I and II before sharply decreasing in Obesity Type III. This trend underscores a clear negative association between physical activity and obesity levels, highlighting the potential role of physical inactivity as a contributing factor to severe obesity. Combined, these visualizations illustrate the importance of promoting physical activity as a preventive and intervention strategy across all weight categories, particularly for individuals at higher obesity levels.

3.11 Water consumption

Code
g25 <- ggplot(dataset, aes(x = ch2o)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", size = 1) +
  ggtitle("Histogram and Density of Comsumption of Water") +
  theme_minimal() +
  labs(x = "CH2O", y = "Density")
plot25 <- ggplotly(g25)
plot25

About the distribution of daily water consumption (CH2O) an histogram with density plot chart was generated to show a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.

Code
g26 <- ggplot(dataset, aes(x = weight, y = ch2o)) +
    geom_point(alpha = 0.5, color = "lightgray") +  # Scatterplot in the background with reduced opacity
    geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1.2) +  # Trend line in the foreground
    labs(title = "Scatterplot with Trend Line: Water Consumption vs Weight", 
         x = "Weight", y = "Water Consumption (ch2o)") +
    theme_minimal()
plot26 <- ggplotly(g26)
plot26

To explore the connection further, a scatterplot with a trend line was created to analyze how water intake varies with weight. The trend line indicates a slight upward pattern, suggesting that water consumption increases moderately as weight rises. Individuals in lower weight categories, such as Insufficient and Normal Weight, show slightly lower water consumption compared to those in higher weight groups, including Obesity Type II and III. This increase in consumption among heavier individuals could reflect greater hydration needs or attempts to implement healthier habits. Despite this trend, water intake appears relatively stable across most weight ranges, highlighting an opportunity to encourage better hydration practices as part of a comprehensive health approach.

3.12 Technology utilization

Code
g28 <- ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Use of Technology by Obesity Level", x = "Use of Technology", y = "Density") +
  theme_minimal()
plot28 <- ggplotly(g28)
plot28

This density plot provides a perspective on the use of technology across different obesity levels. A striking feature is the sharp, dominant peak in Obesity Type III (yellow) around the value of 1. This pattern diverges notably from the smoother and more evenly distributed curves seen in other obesity categories, suggesting a unique behavioral trend in this group.

The peak indicates a strong clustering of individuals in Obesity Type III who report moderate use of technology, which may reflect consistent engagement with technology-based activities such as sedentary work, entertainment, or even health-monitoring applications. In contrast, other obesity categories, such as Obesity Type II and Overweight Level II, exhibit more balanced distributions without a single dominant peak, hinting at more varied technology usage patterns.

This observation raises interesting questions about the role of technology in shaping lifestyle behaviors in Obesity Type III individuals. It may point to a reliance on technology that correlates with a sedentary lifestyle, a known risk factor for obesity. Alternatively, it could reflect targeted interventions or habits specific to this group.

The Exploratory Data Analysis (EDA) phase provided a comprehensive understanding of the dataset, offering key insights into the relationships between various behavioral, lifestyle, and demographic factors and obesity levels. By focusing on critical variables the EDA revealed patterns and trends that are integral to the modeling process.

4 Analysis

The analysis phase is dedicated to the development, refinement, and comprehensive evaluation of the predictive models, meticulously designed to directly address the previously defined research questions.

4.1 Methods

The modeling process is structured to address the two key research questions:

  1. identifying the most significant lifestyle and behavioral factors contributing to obesity in Mexico, Peru, and Colombia;

  2. predicting whether a person will be obese based on some given combinations of factors.

4.1.1 Logistic Regression Model

To accurately address the key research questions, a logistic regression model will be employed to estimate the probability of individuals belonging to a category: obese or not obese. Weight and height will be excluded as predictors in the model because they are directly used to calculate BMI, which serves as the basis for the obesity levels categorized in the dataset. Including these variables would create a dependency between the predictors and the target variable, potentially biasing the analysis. By excluding weight and height, the focus shifts to behavioral and lifestyle factors, such as dietary habits, physical activity, and demographic characteristics, to better understand their influence on obesity risk.

While logistic regression provides a clear and interpretable framework for estimating probabilities, it inherently limits the analysis to a binary classification. This restriction prevents the exploration of the full spectrum of obesity levels, such as Obesity Type I, II, or III, as classified in the dataset. Despite this limitation, logistic regression is a robust method for quantifying the relationships between independent variables and the binary outcome. Feature selection techniques will ensure that only the most relevant predictors are retained, and the model’s performance will be rigorously evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, ensuring reliable and actionable insights.

4.1.2 Insights and Limitations

Regression analysis helps us understand how predictors influence outcomes, with logistic regression classifying individuals as obese or not obese. As already discussed in the previous sections, the dataset offers a mix of advantages and challenges: synthetic data ensures balanced representation but lacks the complexity of real-world patterns, while user-collected data adds variability but is prone to biases. Logistic regression simplifies the analysis by focusing on binary outcomes, leaving out the nuanced gradations of obesity, and assumes linearity, which may not fully capture complex relationships. Despite these limitations, the model offers insights into obesity risk, serving as a valuable exercise and foundation for future data explorations, even if not directly applicable to real-world scenarios.

4.2 Objectives of the Selected Method

4.2.1 Logistic Regression Model Development

Data Loading and Processing

To align with the requirements of a logistic regression model, it was necessary to modify the dataset’s target variable. The original variable, obesity level, was a multi-class categorical variable representing varying degrees of obesity and non-obesity. Since logistic regression is designed for binary classification, the target variable was converted into a binary format. Individuals with a BMI ≥ 30 were classified as obese (1), while others were classified as non-obese (0). This transformation ensured compatibility with the logistic regression framework. Following this adjustment, the target variable was converted into a factor, and the dataset was reviewed for consistency and readiness for analysis.

Code
dataset <- read.csv(here("data/processed/dataset.csv"))
dataset$BMI <- dataset$weight / (dataset$height^2)
head(dataset$BMI) %>%
  tibble::enframe(name = "Row", value = "BMI") %>%
  kable(format = "html", caption = "First 6 BMI Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5")
First 6 BMI Values
Row BMI
1 24.38653
2 24.23823
3 23.76543
4 26.85185
5 28.34238
6 20.19509
Code
summary_BMI <- summary(dataset$BMI)
summary_BMI %>%
  tibble::enframe(name = "Statistic", value = "Value") %>%
  kable(format = "html", caption = "Summary Statistics for BMI") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5")
Summary Statistics for BMI
Statistic Value
Min. 13.00
1st Qu. 24.37
Median 28.90
Mean 29.77
3rd Qu. 36.10
Max. 50.81
Code
dataset$Obesity <- ifelse(dataset$BMI >= 30, 1, 0)
dataset$Obesity <- as.factor(dataset$Obesity)
table(dataset$Obesity) %>%
  as.data.frame() %>%
  kable(format = "html", caption = "Frequency: obesity categories") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5")
Frequency: obesity categories
Var1 Freq
0 1113
1 974
Code
head(dataset) %>%
  kable(format = "html", caption = "First 6 Rows of Updated Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
First 6 Rows of Updated Dataset
gender age height weight family_hist caloric_food vegetable_food nb_meal_day food_btw_meals smoke ch2o calorie_check physical_act use_tech freq_alcohol m_trans obesity_lev BMI Obesity
Female 21 1.62 64.0 yes no 2 3 Sometimes no 2 no 0 1 No Public_Transportation Normal_Weight 24.38653 0
Female 21 1.52 56.0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation Normal_Weight 24.23823 0
Male 23 1.80 77.0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation Normal_Weight 23.76543 0
Male 27 1.80 87.0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking Overweight_Level_I 26.85185 0
Male 22 1.78 89.8 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation Overweight_Level_II 28.34238 0
Male 29 1.62 53.0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile Normal_Weight 20.19509 0

Twelve predictors associated with obesity-related behaviors, dietary habits, physical activity, and lifestyle factors were selected for analysis. These variables, along with the binary target variable Obesity (1 = obese, 0 = not obese), formed the dataset for logistic regression modeling. The dataset was reviewed to ensure correct structure and readiness for analysis.

Code
predictors <- c("family_hist", "caloric_food", "vegetable_food", "nb_meal_day", "food_btw_meals", "smoke", "ch2o", "calorie_check", "physical_act", "use_tech", "freq_alcohol", "m_trans")
model_data <- dataset[, c("Obesity", predictors)]

str_output <- capture.output(str(model_data))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

str_table %>%
  kable(format = "html", caption = "Structure of Model Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Structure of Model Data
Structure
'data.frame': 2087 obs. of 13 variables:
$ Obesity : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ family_hist : chr "yes" "yes" "yes" "no" ...
$ caloric_food : chr "no" "no" "no" "no" ...
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 ...
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 ...
$ food_btw_meals: chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
$ smoke : chr "no" "yes" "no" "no" ...
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 ...
$ calorie_check : chr "no" "yes" "no" "no" ...
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 ...
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 ...
$ freq_alcohol : chr "No" "Sometimes" "Frequently" "Frequently" ...
$ m_trans : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
Code
head(model_data) %>%
  kable(format = "html", caption = "First 6 Rows of Model Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
First 6 Rows of Model Data
Obesity family_hist caloric_food vegetable_food nb_meal_day food_btw_meals smoke ch2o calorie_check physical_act use_tech freq_alcohol m_trans
0 yes no 2 3 Sometimes no 2 no 0 1 No Public_Transportation
0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking
0 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation
0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile

Model Development

Three regression models were employed to ensure a systematic and robust approach to predictor selection and model development. The null model, containing only the intercept, served as a baseline to represent predictions without the influence of any predictors. This provided a reference point to evaluate how much additional explanatory power was gained by including predictors.

The full model, incorporating all predictors, represented the maximum complexity allowable within the dataset. This model helped understand the potential contribution of each variable but carried the risk of overfitting due to its complexity.

The stepwise model, guided by the Akaike Information Criterion (AIC), balanced the simplicity and performance of the model. By iteratively evaluating the inclusion or exclusion of predictors, the stepwise procedure identified the subset of variables that significantly contributed to explaining the outcome while minimizing unnecessary complexity. This process ensured that the final model retained only the most relevant predictors, achieving optimal fit and generalizability. Using these three models allowed for a thorough comparison and the development of a parsimonious and effective predictive model.

Code
full_model <- glm(Obesity ~ ., data = model_data, family = binomial)
null_model <- glm(Obesity ~ 1, data = model_data, family = binomial)
stepwise_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "both", trace = FALSE)

Presented below is a comprehensive overview of the logistic regression models.

Null Model

Code
null_model_summary <- summary(null_model)

null_model_coef_table <- coef(null_model_summary) %>%
  as.data.frame() %>%
  tibble::rownames_to_column("Predictor")

null_model_coef_table %>%
  kable(format = "html", caption = "Coefficients of the Null Logistic Regression Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Coefficients of the Null Logistic Regression Model
Predictor Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.133403 0.0438767 -3.040409 0.0023626

Full Model

Code
coef_table <- coef(summary(full_model)) %>%
  as.data.frame() %>%
  tibble::rownames_to_column("Predictor")

coef_table %>%
  kable(format = "html", caption = "Coefficients of the Full Logistic Regression Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Coefficients of the Full Logistic Regression Model
Predictor Estimate Std. Error z value Pr(>|z|)
(Intercept) -15.3708461 324.7452987 -0.0473320 0.9622486
family_histyes 3.6805286 0.3746264 9.8245297 0.0000000
caloric_foodyes 2.0725408 0.2581442 8.0286177 0.0000000
vegetable_food 0.8746010 0.1124784 7.7757248 0.0000000
nb_meal_day 0.0372577 0.0776059 0.4800884 0.6311646
food_btw_mealsFrequently -2.0413821 0.5923659 -3.4461506 0.0005686
food_btw_mealsNo -0.7121231 0.9055177 -0.7864264 0.4316177
food_btw_mealsSometimes 1.2734295 0.4581389 2.7795707 0.0054431
smokeyes 1.0493108 0.4608027 2.2771367 0.0227781
ch2o 0.1627569 0.1006291 1.6173944 0.1057932
calorie_checkyes -2.6591572 0.6318897 -4.2082618 0.0000257
physical_act -0.3277453 0.0718680 -4.5603756 0.0000051
use_tech -0.3930575 0.0966253 -4.0678543 0.0000474
freq_alcoholFrequently 5.9574819 324.7447982 0.0183451 0.9853635
freq_alcoholNo 6.5421132 324.7446161 0.0201454 0.9839274
freq_alcoholSometimes 6.7010533 324.7446386 0.0206348 0.9835369
m_transBike 0.2810325 1.4324967 0.1961837 0.8444664
m_transMotorbike 1.6026737 0.9376006 1.7093351 0.0873889
m_transPublic_Transportation 0.5713577 0.1296424 4.4071834 0.0000105
m_transWalking -1.9050177 0.6515824 -2.9236788 0.0034592

Stepwise Model

Code
stepwise_summary <- summary(stepwise_model)

stepwise_coef_table <- coef(stepwise_summary) %>%
  as.data.frame() %>%
  tibble::rownames_to_column("Predictor")

stepwise_coef_table %>%
  kable(format = "html", caption = "Coefficients of the Stepwise Logistic Regression Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Coefficients of the Stepwise Logistic Regression Model
Predictor Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.7956227 0.7141792 -12.3157087 0.0000000
family_histyes 3.6796264 0.3746724 9.8209158 0.0000000
food_btw_mealsFrequently -2.0244197 0.5873593 -3.4466462 0.0005676
food_btw_mealsNo -0.6183034 0.8999773 -0.6870211 0.4920694
food_btw_mealsSometimes 1.3604530 0.4559138 2.9840138 0.0028449
caloric_foodyes 2.1168040 0.2566262 8.2485893 0.0000000
vegetable_food 0.8886854 0.1115908 7.9637892 0.0000000
m_transBike 0.3743789 1.4395954 0.2600584 0.7948187
m_transMotorbike 1.6429866 0.9411341 1.7457517 0.0808541
m_transPublic_Transportation 0.6015745 0.1282129 4.6919960 0.0000027
m_transWalking -1.8531534 0.6503271 -2.8495711 0.0043778
calorie_checkyes -2.6730603 0.6307480 -4.2379214 0.0000226
physical_act -0.3462794 0.0705278 -4.9098293 0.0000009
use_tech -0.4174300 0.0956562 -4.3638577 0.0000128
smokeyes 1.0211075 0.4549016 2.2446777 0.0247888
ch2o 0.1679017 0.0994235 1.6887520 0.0912670

Evaluation

To evaluate the stepwise-selected model, predicted probabilities of obesity were generated for all individuals. These probabilities were converted into binary classifications using a threshold of 0.5. A confusion matrix was constructed to assess the model’s performance, providing key metrics such as accuracy, sensitivity, specificity, precision, and F1-score.

Code
library(caret)
library(pROC)
# Predict probabilities from the stepwise model
predicted_probs <- predict(stepwise_model, type = "response")

# Convert probabilities to binary classes (using a threshold of 0.5)
predicted_classes <- ifelse(predicted_probs >= 0.5, 1, 0)

# Create a confusion matrix (comparison of predicted vs. actual values)
conf_matrix <- confusionMatrix(as.factor(predicted_classes), model_data$Obesity)

conf_matrix_table <- as.data.frame(conf_matrix$table)
colnames(conf_matrix_table) <- c("Actual", "Predicted", "Count")
conf_matrix_table <- conf_matrix_table %>%
  group_by(Actual) %>%
  mutate(Percentage = round((Count / sum(Count)) * 100, 2)) %>%
  ungroup()

conf_matrix_table %>%
  kable(format = "html", caption = "Confusion Matrix: Predicted vs Actual Values with Percentages") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  column_spec(1, bold = TRUE, width = "150px") %>%
  column_spec(2, width = "150px") %>%
  column_spec(3, width = "100px") %>%
  column_spec(4, width = "100px") %>%
  scroll_box(width = "100%", height = "400px")
Confusion Matrix: Predicted vs Actual Values with Percentages
Actual Predicted Count Percentage
0 0 749 84.35
1 0 364 30.36
0 1 139 15.65
1 1 835 69.64
Code
performance_metrics_vertical <- as.data.frame(conf_matrix$overall) %>%
  tibble::rownames_to_column("Metric") %>%
  tidyr::pivot_longer(cols = -Metric, names_to = NULL, values_to = "Value")
performance_metrics_vertical %>%
  kable(format = "html", caption = "Confusion Matrix Overall Performance Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Confusion Matrix Overall Performance Metrics
Metric Value
Accuracy 0.7589842
Kappa 0.5227055
AccuracyLower 0.7400388
AccuracyUpper 0.7771994
AccuracyNull 0.5333014
AccuracyPValue 0.0000000
McnemarPValue 0.0000000
Code
class_metrics_vertical <- as.data.frame(conf_matrix$byClass) %>%
  tibble::rownames_to_column("Metric") %>%
  tidyr::pivot_longer(cols = -Metric, names_to = NULL, values_to = "Value")
class_metrics_vertical %>%
  kable(format = "html", caption = "Class-Level Performance Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f5f5f5") %>%
  scroll_box(width = "100%", height = "400px")
Class-Level Performance Metrics
Metric Value
Sensitivity 0.6729560
Specificity 0.8572895
Pos Pred Value 0.8434685
Neg Pred Value 0.6964137
Precision 0.8434685
Recall 0.6729560
F1 0.7486257
Prevalence 0.5333014
Detection Rate 0.3588884
Detection Prevalence 0.4254911
Balanced Accuracy 0.7651228

Additionally, the ROC curve and AUC were calculated to further evaluate the model’s discriminative ability. The ROC curve visualizes the trade-off between sensitivity and specificity, while the AUC quantifies the model’s ability to distinguish between obese and non-obese individuals.

Code
# Compute ROC curve using the actual class labels and predicted probabilities
roc_curve <- pROC::roc(model_data$Obesity, predicted_probs)
auc_value <- pROC::auc(roc_curve)
print(auc_value)
Area under the curve: 0.8556

Results Visualization

To assess the distribution of predicted probabilities, a scatter plot was created with observations color-coded by their actual class. This visualization provides a clear overview of the model’s predictions and potential misclassifications.

Code
plot(predicted_probs, col = ifelse(model_data$Obesity == 1, "blue", "red"), pch = 16, xlab = "n° Observation", ylab = "Predicted Probability", main = "Predicted Probabilities of Obesity", cex = 0.6)
legend("bottomright", legend = c("Obese", "Not Obese"), col = c("blue", "red"), pch = 16)

For additional clarity, the ROC curve was plotted to visually represent the model’s performance.

Code
plot(roc_curve, col = "blue", main = "ROC Curve", lwd = 3, xlim = c(0, 1), ylim = c(0, 1.05), xlab = "False Positive Rate", ylab = "True Positive Rate", cex.main = 1.5, cex.lab = 1.2, cex.axis = 1.1)
legend("topright", legend = paste("AUC =", round(auc_value, 3)), lwd = 0, cex = 1.2, bty = "n")
grid()

Predicting Obesity Probability

To test the model’s ability to predict the probability of individuals becoming obese, six distinct profiles were created, representing a diverse range of lifestyles. Each profile was carefully designed to highlight specific behavioral, dietary, and lifestyle patterns.

The first individual represents a high-risk case for obesity. This person has a family history of being overweight, frequently consumes high-calorie foods and snacks, and eats very few vegetables. They have five meals a day, drink only 0.5 liters of water daily, and do no physical activity. Additionally, they spend 10 hours a day using technology, consume alcohol consistently, and rely primarily on public transportation for mobility.

The second individual exemplifies a very healthy lifestyle. They have no family history of being overweight, rarely consume high-calorie foods or snacks, and eat a large amount of vegetables. Their diet consists of very few meals per day, complemented by a high water intake of 4 liters daily. They do not monitor calorie intake but engage in physical activity five times a week. They walk as their primary mode of transportation, do not consume alcohol, and spend only 0.5 hours daily using technology.

The third individual exhibits a balanced lifestyle but shows some risk factors. This person has a family history of being overweight, frequently consumes snacks and high-calorie foods, and eats a moderate amount of vegetables. They have three meals a day, drink 1 liter of water, and monitor their calorie intake. However, they engage in physical activity only once a week, use technology for 8 hours daily, use motorbike as transportation vehicle, and occasionally consume alcohol.

The fourth individual is physically active and health-conscious. They have no family history of being overweight, do not frequently consume high-calorie foods, but snack occasionally. They eat a lot of vegetables, have a small number of meals per day, and drink 3 liters of water daily. They do not monitor calorie intake but exercise three times a week and use a bicycle for transportation. They consume alcohol frequently but spend only an hour daily using technology.

The fifth individual represents another high-risk case due to a sedentary lifestyle. They have a family history of being overweight, frequently consume high-calorie foods and snacks, and eat very few vegetables. They have four meals a day, drink 2 liters of water, and do no physical activity. They spend 6 hours daily using technology, consume alcohol moderately, and rely on public transportation.

The sixth individual leads a very active lifestyle but has some risk factors due to alcohol and transportation choices. They have no family history of being overweight, do not frequently consume high-calorie foods or snacks, and eat a large amount of vegetables. They have two meals per day, drink 1.5 liters of water, and do not monitor calorie intake. However, they engage in physical activity four times a week, use a motorbike for transportation, do not consume alcohol, and spend 2 hours daily using technology.

These six profiles were designed to test the model’s capacity to handle a wide variety of real-world scenarios, ensuring it can effectively predict obesity probabilities across diverse populations.

Code
new_data <- data.frame(
  family_hist = factor(c("yes", "no", "yes", "no", "yes", "no"), 
                                          levels = c("yes", "no")),
  caloric_food = factor(c("yes", "no", "yes", "no", "yes", "no"), 
                levels = c("yes", "no")),
  vegetable_food = c(1, 5, 2, 4, 1, 3),
  nb_meal_day = c(5, 1, 3, 2, 4, 2),
  food_btw_meals = factor(c("Frequently", "Sometimes", "Always", "Sometimes", "Frequently", "Always"), 
                levels = c("Frequently", "Sometimes", "Always")),
  smoke = factor(c("no", "yes", "no", "yes", "yes", "no"), 
                 levels = c("yes", "no")),
  ch2o = c(0.5, 4, 1, 3, 2, 1.5),
  calorie_check = factor(c("yes", "no", "yes", "no", "yes", "no"), 
               levels = c("yes", "no")),
  physical_act = c(0, 5, 1, 3, 0, 4),
  use_tech = c(10, 0.5, 8, 1, 6, 2),
  freq_alcohol = factor(c("Always", "Never", "Sometimes", "Always", "Sometimes", "Never"), 
                levels = c("Sometimes", "Frequently", "Always", "Never")),
  m_trans = factor(c("Public_Transportation", "Walking", "Motorbike", "Bike", "Public_Transportation", "Motorbike"), 
                  levels = c("Public_Transportation", "Walking", "Bike", "Motorbike"))
)

predicted_probs_new <- predict(stepwise_model, new_data, type = "response")

probability_table <- tibble(
  Reference = 1:length(predicted_probs_new),
  Predicted_Probability = predicted_probs_new,
  Probability_Percentage = predicted_probs_new * 100
)

probability_table %>%
  kbl(format = "html", caption = "Predicted Probabilities") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, width = "100px") %>%
  scroll_box(width = "100%", height = "400px")
Predicted Probabilities
Reference Predicted_Probability Probability_Percentage
1 0.0000337 0.0033741
2 0.0061064 0.6106392
3 0.0031109 0.3110944
4 0.0311437 3.1143676
5 0.0006396 0.0639565
6 0.0015706 0.1570632

4.3 Results

All conclusions drawn from the model outputs were based on a threshold of p < 0.05 to determine statistical significance, with highly significant variables (p < 0.001) offering strong evidence of their association with obesity.

Null Model

The null model, serving as a benchmark for comparison with more complex models, included only the intercept to estimate the probability of obesity, excluding any behavioral or lifestyle predictors. The estimated intercept value of -0.133403 suggests that, in the absence of explanatory variables, the average probability of obesity in the analyzed population is below 50%. This indicates that obesity is not the default outcome when no additional factors are considered. The statistical significance of the intercept (p-value = 0.0024) reinforces the reliability of this baseline estimate, demonstrating that the model is not predicting obesity purely by chance.

Full Logistic Regression Model

The results of the full logistic regression model highlight significant statistical relationships between key variables and the likelihood of obesity. Among the predictors, family history of being overweight demonstrated the strongest association (coefficient = 3.697, p < 0.001), emphasizing the critical role of genetic predisposition in influencing obesity risk.

Dietary habits were also prominently associated with obesity. Frequent consumption of high-calorie foods (FAVC) showed a strong positive effect (coefficient = 2.071, p < 0.001), reinforcing the role of energy-dense diets in driving weight gain. Interestingly, vegetable consumption (FCVC) was also positively associated (coefficient = 0.880, p < 0.001). While unexpected, this may reflect broader dietary patterns, where higher vegetable intake coexists with other risk behaviors in certain populations. Snacking behavior revealed complex relationships. Frequent eating between meals (CAECFrequently) was negatively associated with obesity (coefficient = -2.064, p = 0.0005), suggesting that regular snacking might displace larger, energy-dense meals or align with healthier dietary practices. Conversely, occasional snacking (CAECSometimes) increased obesity risk (coefficient = 1.269, p = 0.0057), which may reflect irregular or opportunistic consumption of unhealthy snacks.

Behavioral and lifestyle factors further enriched the analysis. Smoking (SMOKE) was positively associated with obesity (coefficient = 1.055, p = 0.022), potentially indicating correlations with sedentary behaviors or poor dietary choices often linked to smoking habits. Transportation mode had contrasting effects: reliance on public transportation (MTRANS: Public_Transportation) was positively associated (coefficient = 0.570, p < 0.001), while walking as a primary mode (MTRANS: Walking) was strongly protective (coefficient = -1.903, p = 0.0035). These findings underscore the importance of physical activity embedded in daily routines as a key determinant of obesity risk.

Although some predictors, such as water intake (CH2O) and physical activity frequency (FAF), showed weaker associations, their coefficients suggest that they may still contribute to obesity risk in specific contexts. Overall, the model effectively captured the multifaceted nature of obesity, highlighting the interplay between genetic predisposition, dietary behaviors, and lifestyle choices. These results provide a comprehensive foundation for refining predictive models and developing targeted interventions to mitigate obesity in diverse populations.

Stepwise Logistic Regression Model

The stepwise logistic regression model was developed to identify the most influential predictors of obesity while achieving a balance between complexity and interpretability. Using the Akaike Information Criterion (AIC) to iteratively add or remove variables, the model retained only predictors that significantly explained variations in obesity outcomes. By concentrating on a refined set of significant predictors, the model enhances both interpretability and practical utility, providing valuable insights for targeted interventions and informing public health policies.

Among the retained predictors, family history of overweight emerged again as the most significant variable, with a coefficient of 3.679 (p < 0.001). This finding underscores the critical influence of genetic predisposition on obesity, reinforcing the notion that familial trends play a central role in determining obesity risk. Similarly, frequent consumption of high-calorie foods was strongly linked to obesity, with a coefficient of 2.117 (p < 0.001). Frequent snacking between meals was found to have a significant negative association with obesity (coefficient = -2.024, p = 0.0006), suggesting that certain structured eating behaviors may mitigate obesity risk. Conversely, occasional snacking exhibited a positive association with obesity (coefficient = 1.361, p = 0.0028). Modes of transportation were also identified as significant predictors. Use of public transportation was positively associated with obesity (coefficient = 0.602, p < 0.001), potentially reflecting a sedentary component in daily commuting. In contrast, walking as a primary mode of transportation was negatively associated with obesity (coefficient = -1.853, p = 0.0044), highlighting the protective role of physical activity integrated into daily routines. Other notable predictors included smoking, which showed a positive correlation with obesity (coefficient = 1.021, p = 0.024), and calorie monitoring, which demonstrated a significant protective effect (coefficient = -2.673, p < 0.001). Both physical activity (coefficient = -0.347, p < 0.001) and time spent using technology (coefficient = -0.417, p < 0.001) were negatively associated with obesity.

Model Evaluation

The evaluation of the stepwise logistic regression model demonstrates its effectiveness in predicting obesity based on a set of behavioral and lifestyle predictors. The overall accuracy of the model, calculated at 75.89% with a 95% confidence interval between 74.04% and 77.72%, highlights its reliability in distinguishing obese and non-obese individuals. The kappa statistic of 0.5227 indicates moderate agreement between predicted and actual classifications, confirming that the model’s predictions are well above random chance. Both the accuracy and McNemar test p-values (<0.001) further validate the statistical robustness of the model.

From a class-level perspective, the sensitivity, or the model’s ability to correctly classify obese individuals, was 67.30%. This suggests that while the model captures the majority of true positive cases, a proportion of obese individuals are still misclassified as non-obese. In contrast, the specificity was higher, at 85.73%, indicating a strong ability to correctly identify non-obese individuals. Precision, which quantifies the proportion of correctly identified obese individuals among those predicted as obese, reached 84.35%, highlighting the model’s reliability in minimizing false positive classifications. The F1 score of 74.86% demonstrates a solid balance between precision and recall, further emphasizing the model’s overall efficacy. Balanced accuracy, which averages sensitivity and specificity, was 76.51%, underscoring the model’s consistent performance across both classes.

The receiver operating characteristic (ROC) curve provides additional validation of the model’s predictive capacity. The area under the curve (AUC) was calculated as 0.856, which reflects a high level of discriminative power. This indicates that the model is highly capable of distinguishing between obese and non-obese individuals, with a strong ability to balance false positive and false negative rates across various decision thresholds.

Taken together, these results confirm the predictive reliability and statistical soundness of the stepwise logistic regression model. The findings suggest that the model can serve as a valuable tool for identifying key risk factors associated with obesity and for making predictions. While the model demonstrates strong performance metrics overall, particularly in terms of specificity and precision, future iterations may aim to enhance sensitivity to minimize the misclassification of obese individuals.

Performance

The stepwise logistic regression model was applied to six hypothetical profiles representing diverse behavioral and lifestyle patterns, yielding predicted probabilities ranging from near zero to slightly above 3%. This distribution reflects the model’s conservative thresholds, which prioritize specificity and minimize false positives. While this ensures reliable identification of individuals at high risk, it may underestimate the probabilities for those with moderate or mixed risk factors. The scatter plot of predicted probabilities confirms the model’s ability to differentiate obese from non-obese individuals, particularly at the extremes of the probability range. However, the overlap observed in mid-range probabilities indicates challenges in confidently classifying borderline cases. A notable strength of the model is its interpretability and its reliance on behavioral and lifestyle predictors, which are directly actionable in public health contexts. Its high specificity makes it particularly well-suited for targeted interventions, reducing the risk of over-diagnosis and unnecessary allocation of resources. However, the conservative nature of the model comes at the cost of potentially overlooking individuals at intermediate risk levels. Additionally, while the selected predictors are practical and relevant, the exclusion of psychosocial and environmental factors—such as socioeconomic status or urban versus rural living conditions—limits the model’s applicability in settings where these variables are significant contributors to obesity.

Overall, the stepwise logistic regression model provides a reliable and practical framework for predicting obesity and guiding public health strategies. Its strengths in specificity and precision are clear, but future improvements should focus on enhancing sensitivity and expanding the range of predictors to improve its performance across diverse populations and more complex real-world scenarios. These refinements will ensure the model’s continued relevance and utility in addressing the multifaceted nature of obesity.

5 Conclusion

5.1 Achievements

Key Factors Contributing to Obesity

This study offers a detailed analysis of the lifestyle and behavioral determinants of obesity within the context of Mexico, Peru, and Colombia, yielding significant insights into the interplay of individual habits and environmental factors. The analysis identified family history of obesity as the most significant determinant, underscoring the combined influence of genetic predisposition and shared familial behaviors. This finding aligns with broader epidemiological evidence suggesting that intergenerational patterns, encompassing both biological inheritance and familial lifestyle habits, substantially influence obesity risk.

Dietary practices were also particularly salient. The frequent consumption of high-calorie foods was positively associated with obesity, reinforcing as expected its well-documented role as a critical risk factor. Conversely, structured snacking appeared to exert a protective effect, suggesting that planned and balanced snack consumption might mitigate the risks posed by irregular, energy-dense eating patterns. Notably, these results emphasize that not all snacking behaviors are equally detrimental, providing a nuanced perspective that may inform targeted dietary interventions.

Levels of physical activity emerged as a pivotal determinant, with higher activity levels associated with lower obesity risk. In this context, active transportation, including walking and cycling, demonstrated a particularly robust protective effect. Sedentary behaviors, exemplified by prolonged technology use, were strongly associated with higher obesity prevalence, likely reflecting a shift toward increasingly sedentary lifestyles in urbanized regions. Finally, inadequate vegetable consumption and low water intake compounded obesity risk, highlighting the cumulative effect of dietary quality and hydration on weight regulation.

These findings collectively illustrate the multifaceted nature of obesity within the given dataset, driven by an intricate interplay of hereditary, behavioral, and environmental factors. The results, while not directly generalizable to real-world settings due to the inherent nature of the data, emphasize the critical need for public health interventions tailored to the distinct sociocultural and environmental contexts of these regions.

Predictive Modeling for Obesity Risk

The study’s predictive modeling efforts centered on a logistic regression framework, systematically optimized through stepwise selection using the Akaike Information Criterion (AIC). This approach identified a subset of variables that most effectively predicted obesity outcomes while maintaining model parsimony and interpretability. The model achieved an overall accuracy of 75.89% and an AUC of 0.856, signifying strong performance in distinguishing between obese and non-obese individuals. High specificity (85.73%) indicated the model’s strength in correctly identifying non-obese individuals, while its sensitivity (67.30%) reflected moderate success in detecting true positive cases of obesity.

Key predictors included family history of obesity, dietary habits (notably, high-calorie food consumption and structured snacking), physical activity levels, and transportation modes. The inclusion of modifiable lifestyle variables highlights the model’s utility in informing targeted interventions. For example, individuals with higher obesity risk due to behavioral factors such as low physical activity or frequent consumption of high-calorie foods could be directed toward tailored public health programs aimed at mitigating these risks.

The model’s integration of behavioral data and its predictive accuracy suggest practical applications in population health management. By identifying individuals at elevated risk, resources can be prioritized toward preventive measures, including nutritional education, physical activity promotion, and family-centered interventions. Furthermore, the ability to quantify obesity risk based on modifiable factors provides actionable insights for developing evidence-based health policies.

5.2 Final Thoughts and Proposals

This study has provided valuable insights into the determinants of obesity in Mexico, Peru, and Colombia by identifying critical behavioral and lifestyle factors. However, addressing certain limitations will be essential to strengthen the rigor and expand the scope of future research efforts. The exclusion of critical psychosocial, environmental, and cultural determinants—such as socioeconomic status, urban versus rural living conditions, and dietary norms—limits the explanatory power of the current model. Future studies should incorporate these variables to capture the complex interplay between contextual factors and obesity risk, thereby improving predictive accuracy. Additionally, the strong association between family history and obesity highlights the need to investigate intergenerational behavioral patterns and family dynamics, particularly as they relate to shared environmental and genetic factors. Longitudinal research is essential to disentangle causality, assess temporal trends, and evaluate the sustained impact of interventions, allowing for a more comprehensive analysis of obesity progression over time.

From a public health perspective, the findings emphasize the necessity for integrated, multi-level interventions addressing both individual behaviors and broader systemic factors. Policies should prioritize active transportation initiatives, such as investments in pedestrian and cycling infrastructure, alongside public awareness campaigns to encourage physical activity as a routine part of daily life. For individuals with a family history of obesity, tailored interventions, including genetic risk counseling, family-based behavioral therapies, and parent-child lifestyle modification programs, should be prioritized to address both hereditary predispositions and modifiable risk factors.

Nutritional education efforts must target high-calorie food consumption, emphasizing balanced diets and structured snacking habits. Fiscal measures such as taxation of sugary beverages and processed foods, coupled with subsidies for fresh produce, represent economically viable approaches to incentivize healthier eating. Family-focused meal planning workshops and culturally sensitive dietary guidelines would further enhance the effectiveness of these interventions, directly addressing both immediate dietary behaviors and the intergenerational transmission of risk factors.

At the community level, expanding access to physical activity opportunities through well-maintained parks and public exercise facilities is critical in mitigating sedentary behaviors identified in the analysis. Initiatives such as family-friendly sports leagues or community exercise classes could foster collective engagement in healthier lifestyles. Workplace wellness programs promoting active breaks and nutritious meal options provide another avenue for reducing obesity prevalence. Furthermore, leveraging technology, such as culturally tailored mobile applications for dietary tracking, physical activity monitoring, and behavior modification, could empower individuals to adopt healthier routines. Incorporating family history as a personalized variable in these tools could help users understand their hereditary risks, facilitating tailored behavior change strategies.

These findings provide a compelling foundation for evidence-based public health policies targeting obesity in Mexico, Peru, and Colombia. By combining data-driven insights with culturally tailored interventions, this study supports the development of a comprehensive framework for addressing the multifaceted challenges of obesity. Future research should broaden the scope to include psychosocial and environmental determinants, employ longitudinal designs, and evaluate the scalability of interventions across diverse populations. Emphasizing early, family-centered approaches will be crucial in interrupting the cycle of intergenerational obesity, thereby promoting sustained health improvements for future generations.